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(54) Singing voice synthesizing apparatus, singing voice synthesizing method, and program for 
realizing singing voice synthesizing method 



(57) A singing voice synthesizing apparatus is pro- 
vided, which enables achievement of a natural sounding 
synthesized singing voice with a good level of compre- 
hensibility. A phoneme database stores a plurality of 
voice fragment data formed of voice fragments each be- 
ing a single phoneme or a phoneme chain of at least 
two concatenated phonemes, each of the plurality of 
voice fragment data comprising data of a deterministic 
component and data of a stochastic component. A rea- 
dout device that reads out from the phoneme database 
the voice fragment data corresponding to inputted lyrics. 



A duration time adjusting device adjusts time duration 
of the read-out voice fragment data so as to match a 
desired tempo and manner of singing. An adjusting de- 
vice adjusts the deterministic component and the sto- 
chastic component of the read-out voice fragment so as 
to match a desired pitch. A synthesizing device synthe- 
sizes a singing sound by sequentially concatenating the 
voice fragment data that have been adjusted by the du- 
ration time adjusting device and the adjusting device. 
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Description 

BACKGROUND OF THE INVENTION 
Field of the Invention 

[0001 ] The present invention relates to a singing voice 
synthesizing apparatus that synthesizes a singing 
voice, a method of synthesizing a singing voice, and a 
program for realizing the method thereof. 

Description of the Related Art 

[0002] In the past, there has been a wide range of at- 
tempts to synthesize singing voice. 
[0003] One of these attempts, an application of 
speech synthesis by rule, receives inputs of pitch data, 
which corresponds to the pitch of a note, and of lyric 
data, and synthesizes speech using a synthesis-by-rule 
device for text-to-speech synthesis. In most cases, raw 
wav form data or analyzed and parameterized data are 
stored in a database in units of phonemes or phoneme 
chains comprised of two or more phonemes. At the time 
of synthesis, required voice fragments (phonemes or 
phoneme chains) are selected, concatenated, and syn- 
thesized. Examples are disclosed in Japanese Laid- 
Open Patent Publications (Kokai) Nos. S62-6299, 
H10-124082, and H11-11 84490, among others. 
[0004] However, since the object of these technolo- 
gies is to synthesize a speaking voice, they are not al- 
ways capable of synthesizing a singing voice with sat- 
isfactory quality. 

[0005] For example, a singing voice synthesized by a 
method of overlapping and adding waveforms as typi- 
fied by PSOLA (Pitch-Synchronous OverLap and Add) 
has a good degree of comprehensibility, but often has 
the problems of unnatural sounding of elongated tones, 
for which the quality of a singing voice varies the great- 
est, and an unnatural sounding synthesized voice when 
there are slight fluctuations of pitch and vibrato, which 
are ssential for a singing voice. 
[0006] Moreover, attempting to synthesize a singing 
voice using a waveform concatenating type speech syn- 
thesizing device with a large-scale corpus base would 
require an astronomically large number of fragment data 
if the original data are to be concatenated and output 
without any processing. 

[0007] On the other hand, synthesizers whose origi- 
nal purpose is for synthesizing a singing voice have also 
been proposed. A well-known example is the synthesis 
method of formant synthesis (Japanese Laid-Open Pat- 
ent Publication (Kokai) No. 3-200300). However, al- 
though this method offers a large degree of freedom with 
respect to the quality and fluctuations of vibrato and 
pitch of elongated sounds, th clarity of synthesized 
sounds (especially consonants) is poor, and therefore 
quality is not always satisfactory. 
[0008] U.S. Patent No. 5029509 discloses a tech- 
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nique known as Spectral Modeling Synthesis (SMS) for 
analyzing and synthesizing a musical sound using a 
model that xpresses an original sound as comprised 
of two components, namely a deterministic component 

5 and a stochastic compon nt. 

[0009] With SMS analysis and synthesis, good control 
of the musical characteristics of a musical sound is pos- 
sible, and at the same time, in the case of a singing 
voice, through use of the stochastic component, a high 

w degree of clarity can be expected from even the conso- 
nants. Therefore, applying this technique to the synthe- 
sis of a singing voice is expected to achieve a synthe- 
sized sound having a high degree of clarity and musi- 
cality. In fact, Japanese Patent No. 2906970 proposes 

is specific applications for sound synthesis based on SMS 
analysis and synthesis techniques, and at the same 
time, also describes a methodology for utilizing SMS 
techniques in singing voice synthesis (singing synthe- 
sizer). 

20 [0010] An application of the techniques proposed in 
the aforementioned Japanese Patent No. 2906970 to a 
singing voice synthesizing apparatus will be described 
with reference to FIG. 17. 

[0011] In FIG. 17, input voices are SMS-analyzed and 

25 segmented into individual voice fragments (phonemes 
or phoneme chains) by an SMS-analyzer/segmentor 
1 03, which are stored to generate a phoneme database 
1 00. The database 1 00, comprising voice fragment data 
(phoneme data 101 and phoneme chain data 1 02) for a 

30 single frame or plurality of frame strings arranged in a 
time series, stores SMS data for each frame, namely 
changes over time of the spectral envelope of the de- 
terministic component, the spectral envelope and phase 
spectrum of the stochastic component, etc. 

35 [0012] When synthesizing a singing voice sound, a 
phoneme string comprising the desired lyrics is ob- 
tained, a phoneme-to-fragment converter 104 deter- 
mines the required voice fragments (phonemes or pho- 
neme chains) that comprise the phoneme string, and 

*o then SMS data (deterministiccomponent and stochastic 
component) of the required voice fragments is read from 
the aforementioned database 100. Next, a fragment 
concatenator 1 05 concatenates the read-out SMS data 
of the voice fragments into a time series. For the deter- 

45 ministic component, based on pitch information corre- 
sponding to a melody of the song, a deterministic com- 
ponent generator 1 06 generates harmonic components 
having the desired pitch while preserving the shape of 
the spectral envelope of the deterministic component. 

so For example, to synthesize the Japanese word "saita", 
the fragments of ,, #s , \ "s", "s-a M , M a'\ "a-i", "i", "i-f, T, "t- 
a", "a", and "a#" are concatenated, and the deterministic 
component of the desired pitch is generated while pre- 
serving the shape of the spectral envelope included in 

55 the SMS data obtain d from the fragm nt concat na- 
tion. Next, the gen rated det rministic compon nt and 
the stochastic component ar added together by a syn- 
thesizing means 107, and the result thereof is trans- 
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formed into time domain data to obtain synth siz d 
voice. 

[0013] By thus utilizing these SMS techniques, natu- 
ral sounding synthesized singing with good comprehen- 
sibility can be obtained even for elongated sounds. 
[0014] However, the method described in the afore- 
mentioned Japanese Patent No. 2906970 is overly ru- 
dimentary and simplistic, and the following types of 
problems will occur if a singing voice is synthesized ac- 
cording to that method. 

Because the spectral envelope shape of the deter- 
ministic component of a voiced sound changes 
somewhat depending on pitch, synthesis at a pitch 
different from the pitch used at the time of analysis 
cannot, by itself, achieve good tone color. 
When performing SMS analysis in the case of a 
voiced sound, even if the deterministic component 
is removed, a small fraction of the deterministic 
component remains in the residual component. 
Therefore, using the same residual component 
(stochastic component) directly to synthesize a 
singing sound at a pitch different from the original 
sound as noted above causes the residual compo- 
nent to become audible noticeably or like noise. 
Because the SMS analysis results of phoneme data 
and phoneme chain data are superposed temporal- 
ly as they are, the duration of an elongated sound 
and transitional time between phonemes cannot be 
adjusted. In other words, it is not possible to sing at 
a desired tempo. 

Noise is apt to be generated when concatenating 
the phonemes or phoneme chains. 

SUMMARY OF THE INVENTION 

[001 5] It is a first object of the present invention to pro- 
vide a singing voice synthesizing apparatus and a sing- 
ing voice synthesizing method that resolve the above 
described problems through prescribing a specific 
method for utilizing the SMS techniques proposed in the 
aforementioned Japanese Patent No. 2906970 and 
adding considerable improvements for enhancing the 
synthesized sound quality, to thereby enable achieve- 
ment of a natural sounding synthesized singing voice 
with a good level of comprehensibility, and a program 
for realizing a singing voice synthesizing method. 
[001 6] It is a second object of the present invention to 
provide a singing voice synthesizing apparatus and a 
singing voice synthesizing method that are capable of 
reducing the size of the aforementioned database and 
increasing the efficiency with which the database is gen- 
erat d, and a program for realizing a singing voice syn- 
thesizing method. 

[0017] It is a third obj ct of th present inv ntion to 
provide a singing voic synthesizing apparatus appara- 
tus and a singing voice synthesizing method that ar ca- 
pable of adjusting the degree of huskiness in a synthe- 
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sized voice, and a program for realizing a singing voice 
synthesizing method. 

[0018] To attain the objects, the present invention pro- 
vides a singing voice synthesizing apparatus compris- 
s ing a phoneme database that stores a plurality of voice 
fragment data formed of voice fragments each being a 
single phoneme or a phoneme chain of at least two con- 
catenated phonemes, each of the plurality of voice frag- 
ment data comprising data of a deterministic component 
10 and data of a stochastic component, an input device that 
inputs lyrics, a readout device that reads out from the 
phoneme database the voice fragment data corre- 
spond ngto the inputted lyrics, a duration time adjusting 
device that adjusts time duration of the read-out voice 
15 fragment data so as to match a desired tempo and man- 
ner ov singing, an adjusting device that adjusts the de- 
terministic component and the stochastic component of 
the read-out voice fragment so as to match a desired 
pitch, and a synthesizing device that synthesizes a sing- 

20 ing sound by sequentially concatenating the voice frag- 
ment data that have been adjusted by the duration time 
adjusting device and the adjusting device. 
[0019] With the above arrangement according to the 
present invention, through improvement of the SMS 

25 techniques, a natural sounding synthesized singing 
voice with a good level of comprehensibility can be ob- 
tained even for elongated sounds, and further, even 
slight variations of vibrato and pitch do not result in an 
unnatural sounding synthesized sound. 

30 [0020] Preferably, the phoneme database stores a 
plurality of voice fragment data having different musical 
expressions for a single phoneme or phoneme chain. 
[0021] More preferably, the musical expressions in- 
clude at least one parameter selected from the group 

35 consisting of pitch, dynamics and tempo. 

[0022] In a preferred embodiment of the present in- 
vention, the phoneme database stores voice fragment' 
data comprising elongated sounds that are each enun- 
ciated by elongating a single phoneme, voice fragment 

40 data comprising consonant-to-vowel phoneme chains 
and vowel-to-consonant phoneme chains, voice frag- 
ment data comprising consonant-to-consonant pho- 
neme chains, and voice fragment data comprising vow- 
el-to-vowel phoneme chains. 

45 [0023] In a preferred form of the present invention, 
each of the voice fragment data comprises a plurality of 
data corresponding respectively to a plurality of frames 
of a frame string formed by segmenting a corresponding 
one of the voice fragments, and wherein the data of the 

50 deterministic component and the data of the stochastic 
component of each of the voice fragment data each 
comprise a series of frequency domain . data corre- 
sponding respectively to the plurality of frames of the 
frame string corresponding to each of the voice frag- 

55 ments. 

[0024] Mor ov r, in this preferr d form, the duration 
time adjusting device generates a frame string of a de- 
sired time length by repeating at least one frame of the 
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plurality of frames of the frame string corresponding to 
each of the voice fragments, or by thinning out a prede- 
termined number of frames of the plurality of frames of 
the frame string corresponding to each of the voice frag- 
ments, 5 

[0025] With this arrangement, since the length of an 
elongated phoneme and length of a phoneme chain can 
be adjusted freely, a synthesized singing voice can be 
obtained at a desired tempo. 

[0026] More preferably, the duration time adjusting 10 
device generates the frame string of a desired time 
length by repeating a plurality of frames of the frame 
string corresponding to each of the voice fragments, the 
duration time adjusting device repeating the plurality of 
frames in a first direction in which the frame string of a is 
desired time length is generated and in a second direc- 
tion opposite thereto. 

[0027] Still more preferably, when repeating the plu- 
rality of frames of the frame string corresponding to the 
data of the stochastic compoenent of each of the voice 20 
fragments in the first and second directions, the duration 
time adjusting device reverses a phase of a phase spec- 
trum of the stochastic component. 
[0028] Preferably, the singing voice synthesizing ap- 
paratus according to the present invention further com- 25 
prises a fragment level adjusting device that performs 
smoothing processing or level adjusting processing on 
the deterministic component and the stochastic compo- 
nent contained in each of the voice fragment data when 
the voice fragment data are sequentially concatenated 30 
by the synthesizing device. 

[0029] With this arrangement, since a smoothing or 
level adjusting process is performed at the concatena- 
tion boundary between phonemes, noise is not gener- 
ated when the phonemes are concatenated. 35 
[0030] Also preferably, the singing voice synthesizing 
apparatus according to the present invention further 
comprises a deterministic component generating device 
that changes only pitch of the deterministic component 
to a desired pitch while preserving the spectral envelope 40 
shape of the deterministic component contained in each 
of the voice fragment data when the voice fragment data 
are sequentially concatenated by the synthesizing de- 
vice. 

[0031] Preferably, the phoneme database stores 45 
voice fragment data comprising elongated sounds that 
are each enunciated by elongating a single phoneme, 
the phoneme database further storing a flat spectrum 
as an amplitude spectrum of the stochastic component 
of each of the voice fragment data comprising each of so 
the elongated sounds, obtained by multiplying the am- 
plitude spectrum thereof by an inverse of a typical spec- 
trum within an interval of the elongated sound. 
[0032] In this case, the amplitude spectrum of the sto- 
chastic component of each of the voice fragment data ss 
comprising each of the elongated sounds is obtained by 
multiplying an amplitud spectrum of the stochastic 
component calculated based on an amplitude spectrum 



of the deterministic compon nt of the voice fragment da- 
ta of the elongated sound, by the flat spectrum. 
[0033] Preferably the phoneme database does not 
store amplitude spectra of stochastic components of 
voice fragment data comprising certain elongated 
sounds, and the flat spectrum stored as an amplitude 
spectrum of voice fragment data comprising at least one 
other elongated sound is used for synthesis of the cer- 
tain sounds. 

[0034] Preferably the amplitude spectrum of the sto- 
chastic component calculated based on the amplitude 
spectrum of the deterministic component has a gain 
thereof at 0Hz controlled according to a parameter for 
controlling a degree of huskiness. 
[0035] With this arrangement, the degree of huski- 
ness of a synthesized voice can be controlled simply 
[0036] To attain the above objects, the present inven- 
tion also provides a singing voice synthesizing method 
comprising the steps of storing in a phoneme database 
a plurality of voice fragment data formed of voice frag- 
ments each being a single phoneme or a phoneme 
chain of at least two concatenated phonemes, each of 
the plurality of voice fragment data comprising data of 
a deterministic component and data of a stochastic com- 
ponent, reading out from the phoneme database the 
voice fragment data corresponding to lyrics inputted by 
an input device, adjusting time duration of the read-out 
voice fragment data so as to match a desired tempo and 
manner of singing, adjusting the deterministic compo- 
nent and the stochastic component of the read-out voice 
fragment so as to match a desired pitch, and synthesiz- 
ing a singing sound by sequentially concatenating the 
voice fragment data that have been adjusted in respect 
of the time duration and the deterministic component 
and the stochastic component thereof. 
[0037] To attain the above objects, the present inven- 
tion further provides a program for causing a computer . 
to execute the above mentioned singing voice synthe- 
sizing method. 

[0038] To attain the above objects, the present inven- 
tion further provides a mechanically readable storage 
medium storing instructions forcausing a machine to ex- 
ecute the above mentioned singing voice synthesizing 
method. 

[0039] According to the present invention, the synthe- 
sized singing voice can be of high quality, having an ap- 
propriate tone color for a desired pitch, and is free of 
noise between concatenated units. Further, the data- 
base can be made extremely small in size and can be 
generated with a higher efficiency. Still further, the de- 
gree of huskiness of a synthesized voice can be con- 
trolled simply. 

[0040] The above and other objects, features, and ad- 
vantages of the invention will become more apparent 
from the following detailed description tak n in conjunc- 
tion with the accompanying drawings. 
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BRIEF DESCRIPTION OF THE DRAWINGS 
[0041] 

FIG. 1 is a diagram illustrating a process for gener- 5 
ating a phoneme database used in a singing voice 
synthesizing apparatus of the present invention; 
FIGS. 2A and 2B is a diagram illustrating a process 
for synthesizing a singing voice carried out by the 
singing voice synthesizing apparatus of the present 1 o 
invention; 

FIGS. 3A and 3B are diagrams illustrating a process 
for adjusting a stochastic component carried out by 
the singing voice synthesizing apparatus of the 
pr sent invention, in which: 15 

FIG. 3A shows an example of amplitude spec- 
trum of a stochastic component obtained by 
SMS analysis of a voiced sound; and 
FIG. 3B shows the result of performing a sto- 20 
chastic component adjusting process on the 
amplitude spectrum of the stochastic compo- 
nent of FIG. 3A; 

FIGS. 4A to 4C are diagrams illustrating a looping 25 
process carried out by the singing voice synthesiz- 
ing apparatus of the present invention, in which: 

FIG. 4A shows an example of a stochastic com- 
ponent waveform that will be subjected to loop 30 
processing; 

FIG. 4B shows the result of loop processing the 
waveform of FIG. 4A, where frames are read- 
out in a reverse direction, with the phase un- 
changed; and 35 
FIG. 4C shows the result of loop processing the 
waveform of FIG. 4A, where frames are read- 
out in a reverse direction, with the phase re- 
versed; 

40 

FIG. 5 is a diagram illustrating the modeling of a 
sp ctral envelope; 

FIG. 6 is a diagram useful in explaining a mismatch 
at a fragment data concatenation boundary; 
FIG. 7 is a diagram illustrating a smoothing process *s 
in the singing voice synthesizing apparatus of the 
present invention; 

FIGS. 8A through 8C are diagrams illustrating a lev- 
el adjusting process carried out by the singing voice 
synthesizing apparatus of the present invention, in so 
which: 

FIG. 8A is a diagram illustrating a level adjust- 
ing process for fragment "a-i" at the time when 
the fragments of "a-i" and "i-a" are to be con- 55 
catenated; 

FIG. 8B is a diagram illustrating a level adjust- 
ing process for fragment "i-a"; and 



FIG. 8C is a diagram showing a result of con- 
catenating the level adjusted fragments of "a-i" 
and "i-a"; 

FIGS. 9A and 9B is a function block diagram illus- 
trating a detailed configuration of a singing voice 
synthesizing apparatus according to an embodi- 
ment of the present invention; 
FIG. 10 is a diagram illustrating an example of the 
construction of a hardware apparatus used to oper- 
ate a singing voice. synthesizing apparatus of the 
present invention; 

FIG. 11 is a diagram illustrating an example of spec- 
tral envelopes of deterministic and stochastic com- 
ponents of an elongated sound; 
FIG. 12 is a diagram illustrating a process for gen- 
erating a phoneme database carried out by a sing- 
ing voice synthesizing apparatus according to an- 
other embodiment of the present invention; 
FIG. 13 is a diagram illustrating an example of the 
configuration of a spectral whitening means; 
FIGS. 14A and 14B is a diagram illustrating a sing- 
ing voice synthesis process carried out by the sing- 
ing voice synthesizing apparatus according to the 
other embodiment of of the present invention; 
FIG. 1 5 is a diagram useful in explaining the control 
of huskiness; 

FIG. 16 is a diagram illustrating an example of the 
configuration of a spectral envelope generating 
means that is adapted to control huskiness; and 
FIG. 17 is a diagram illustrating the construction of 
a singing voice synthesizing apparatus that em- 
ploys the conventional SMS method. 

DETAILED DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 

[0042] The singing voice synthesizing apparatus of 
the present invention has a phoneme database which 
is comprised of individual phonemes and phoneme 
chains that have been obtained by dividing into required 
segments SMS data of deterministic and stochastic 
components obtained from an SMS analysis of input 
voices. This database also contains heading informa- 
tion including information indicative of the phonemes 
and phoneme chains, information indicative of the pitch 
of voice fragments formed of the phonemes and pho- 
neme chains, and information indicative of musical ex- 
pressions such as dynamics and tempo thereof. Here, 
the dynamics information may be either sensory infor- 
mation indicative of whether the voice fragment (pho- 
neme or phoneme chain) is a forte or mezzo forte sound, 
or physical information indicating the level of the frag- 
ment. 

[0043] Moreov r, an SMS analysis m ans is provided 
for decomposing the input singing voice into determin- 
istic and stochastic c mponents, and analyzing them in 
order to generate the afor mentioned database. Also, 



5 



-ID: <EP 1220195A2_I_> 



9 



EP1 220 195 A2 



10 



a means (which may be either automatic or manual) for 
segmenting the SMS data into the required phonemes 
or phoneme chains (fragments) is provided. 
[0044] An example of generating the phoneme data- 
base will be described with reference to FIG. 1 . 
[0045] In FIG. 1 , reference numeral 1 0 designates the 
phoneme database in which are stored SMS data in the 
form of voice fragments (SMS data of one or more 
frames determined by the respective voice fragments) 
obtained by subjecting input singing voices to an SMS 
analysis and segmenting the resulting SMS data into 
phonemes and phoneme chains (voice fragments) by a 
segmentor 1 4 in a manner similarto the aforementioned 
phoneme database 100. In the phoneme database 10, 
the fragment data are stored in the form of separate data 
for each different pitch, and for each different dynamics 
and tempo. 

[0046] I n the case of synthesizing Japanese language 
lyrics, ihe voice fragments are comprised of, for exam- 
ple, vowel sound data (one or a plurality of frames), con- 
sonant-to-vowel sound data (a plurality of frames), vow- 
el-to-consonant sound data (a plurality of frames), and 
vowel-to-vowel data (a plurality of frames). 
[0047] A voice synthesis apparatus that uses voice 
synthesis by rule or the like normally stores data in its 
phoneme database in units that are longerthan one syl- 
lable : such as VCV (vowel-consonant-vowel) or CVC 
(consonant-vowel-consonant) units. On the other hand, 
in the singing voice synthesizing apparatus of the 
present invention which aims to synthesize a singing 
voic sound, data of elongated sound, which frequently 
occurs in singing as the enunciation of long vowels, con- 
son ant-to-vowel (CV), vowel-to-consonant (VC) sound 
data, consonant-to-consonant sound data, and vowel- 
to-vowel sound data are stored in the phoneme data- 
base. 

[0048] The SMS analyzer 13 performs an SMS anal- 
ysis of original input singing voices and outputs SMS- 
analyzed data for each frame. 

[0049] More specifically, the input voice is divided into 
a series of time frames, and an FFT or other frequency 
analysis is performed for each frame. From the resulting 
frequency spectra (complex spectra), amplitude spectra 
and phase spectra are obtained, and a specific frequen- 
cy spectrum that corresponds to a peak in the amplitude 
spectrum is extracted as a line spectrum. In this case, 
a spectrum containing the fundamental frequency and 
frequ ncies in the vicinity of its integer multiples is a line 
spectrum. This extracted line spectrum corresponds to 
the deterministic component. i 
[0050] Next, a residual spectrum is obtained by sub- 
tracting the line spectrum, which has been extracted as 
described above, from the spectrum of the input wave- 
form of the frame. Alternatively, temporal waveform data 
of the det rministic component, which has been synthe- t 
sized from the xtracted line spectrum, is subtracted 
from the input waveform data of that frame to obtain tem- 
poral waveform data of the residual component, and 



then a fr quency analysis of the residual component 
temporal waveform data is performed to obtain the re- 
sidual spectrum. The thus-obtained residual spectrum 
corresponds to the stochastic component. 
* [0051] Th frame period used in the above SMS anal- 
ysis may have either a certain fixed length, or a variable 
length that changes according to the pitch or other pa- 
rameter of the input voice. If the frame period has a var- 
iable length, the input voice is processed with a first 
> frame period of fixed length, the pitch is detected, and 
then the input voice is reprocessed with a frame period 
of a length that corresponds to the results of the pitch 
detention; alternatively, a method may be employed, in 
whicn the period of the following frame is varied accord- 
; ing to the pitch detected from the present frame. 
[0052] The SMS-analyzed data output for each frame 
from the SMS analyzer 13 is segmented into the length 
of a voice fragment stored in the phoneme database by 
the segmentor 1 4. More specifically, the SMS-analyzed 
data is manually or automatically segmented to extract 
vowel phonemes, vowel-consonant or consonant-vowel 
phoneme chains, consonant-consonant phoneme 
chains, and vowel-vowel phoneme chains so as to be 
optimally suited for singing sound synthesis. Here, long 
interval data of vowels that are to be elongated and sung 
(elongated sounds) are also extracted by segmentation 
as vowel phonemes. 

[0053] Moreover, the segmentor 14 detects the pitch 
of the input voice based on the aforementioned SMS 
analysis results. The pitch detection is performed by first 
calculating an average pitch value from the frequency 
of lower-order line spectra in the deterministic compo- 
nent of a frame included in the fragment, and then cal- 
culating an average pitch value for all frames. 
[0054] In this manner, data of the deterministic com- 
ponent and data of the stochastic component are ex- 
tracted for each fragment and stored in the phoneme 
database 10, with headings comprised of information of 
the pitch of the input singing voice and musical expres- 
sions of tempo, dynamics, etc. appended thereto. 
[0055] FIG. 1 shows one example of the phoneme da- 
tabase 10 that has been created in this manner. The 
phoneme database 1 0 is comprised of a phoneme data 
area 11 for phonemes, and a phoneme chain data area 
1 2 for phoneme chains. The phoneme data area 1 1 con- 
tains four types of phoneme data of elongated vowel "a" 
at four pitch frequencies of 130 Hz, 150 Hz, 200 Hz and 
220 Hz, and three types of phoneme data of elongated 
vowel T at three pitch frequencies 140 Hz, 180 Hz and 
300 Hz. Moreover, the phoneme chain data area 1 2 con- 
tains two types of phoneme chain data of phoneme 
chain M a-i'\ indicating the concatenation of phonemes 
"a" and "I", at two pitch frequencies of 130 Hz and 150 
Hz, two types of phoneme chain "a-p" at two fr qu ncies 
of 120 Hz and 220 Hz, two typ s of phon m chain M a- 
s M at frequencies of 140 Hz and 180 Hz, and on type 
of phoneme chain "a-z° at a frequency of 1 00 Hz. Here, 
for the same phoneme or phoneme chain, data of differ- 
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ent pitch s ar stored; however as described above, da- 
ta of different musical expressions of the input singing 
voice, such as dynamics and tempo, are also stored as 
separate data. 

[0056] Of data of deterministic and stochastic compo- 
nents contained in the data of each fragment, namely, 
SMS data SMS from the aforementioned SMS analyzer 
13 that has been segmented into individual fragments 
by the segmentor 14, the data of deterministic compo- 
nents may be stored either by storing all spectral enve- 
lopes (line spectra (harmonic series) strength (ampli- 
tude) and phase spectra) of each frame contained in 
each fragment as they are, or by storing arbitrary func- 
tions that express the spectral envelopes instead of 
spectral envelopes. The data of deterministic compo- 
nents may also be stored in the form of inverse-trans- 
formed temporal waveforms. Furthermore, the data of 
stochastic components may be stored in the form of 
strength spectra (amplitude spectra) and phase spectra 
for each frame of the segment corresponding to each 
fragment, or in the form of temporal waveform data of 
each segment. Moreover, the above-noted storage for- 
mats are not limitative, but may be varied for each frag- 
ment, or according to vocal properties (such as nasal, 
fricative or plosive sounds) of each segment. In the de- 
scription that follows, the deterministic component data 
are stored in the format of spectral envelopes, and the 
stochastic component data are stored in the format of 
amplitude spectra and phase spectra. With these types 
of storage format, the required storage capacity can be 
reduced. r: 

[0057] In this manner, in the singing voice synthesiz- 
ing apparatus of the present invention, the phoneme da- 
tabase 10 stores a plurality of data corresponding to dif- 
ferent pitches, dynamics, tempos, and-other musical ex- 
pressions for each of the same phoneme and the same 
phoneme chain. 

[0058] Next, the process of synthesizing singing 
sounds using the phoneme database 1 0 created as de- 
scribed above will be described with reference to FIGS. 
2A and 2B. 

[0059] In FIGS. 2A and 2B, reference numeral 10 des- 
ignates the phoneme database 10. Reference numeral 
21 designates a phoneme-to-fragment conversion 
means 21 that converts a phoneme string correspond- 
ing to the lyric data of a song for which a singing sound 
is to be synthesized, into fragments for searching the 
phoneme database 10. For example, if a phoneme 
string of M s_a_Lt_a" is input, then a fragment string of 
"s", "s-a", "a", "a-i", V, "i-t", T, M t-a M , and M a" is output. 
[0060] Reference numeral 22 designates a determin- 
istic component adjusting means that, based on control 
pa ram ters such as pitch, dynamics and tempo that are 
included in the melody data of the song, adjusts the data 
of the deterministic compon nt of fragm nt data r ad 
fromth phon me database 10, and refer nee numeral 
23 deisgnates a stochastic compon nt adjusting means 
that adjusts the data of the stochastic component. 



CIO: <EP 1220195A2_I_> 



[0061] Refer nc numeral 24 designat s a duration 
time adjusting means that varies the duration time of 
fragment data output from the deterministic component 
adjusting means 22 and from the stochastic component 
5 adjusting means 23. Reference numeral 25 designates 
a fragment level adjusting means that adjusts the level 
of each fragment data output from the duration time ad- 
justing means 24. Reference numeral 26 designates a 
fragment concatenating means that concatenates indi- 

10 vidual fragment data, which have been level-adjusted 
by the fragment level adjusting means 25, into a time 
series. Reference numeral 27 desinates a deterministic 
component generating means that, based on the deter- 
ministic components of fragment data that have been 

'5 concatenated by the fragment concatenating means 26, 
generates deterministic components (harmonic compo- 
nents) having a desired pitch. Reference numeral 28 
designates an adding means that synthesizes harmonic 
components generated by the deterministic component 

20 generating means 27 and stochastic components out- 
put from the fragment concatenating means 26. Voice 
synthesis can be achieved by transforming the output 
from this adding means 28 into a time domain signal. 
[0062] The processing of each of the above-men- 

25 tioned blocks will be described below. 

[0063] The phoneme-to-fragment conversion means 
21 generates a fragment string from a phoneme string 
that has been converted based on the input lyrics, and 
thereupon selectively reads out voice fragments (pho- 

30 nemes or phoneme chains) from the phoneme database 
10. As described previously, even for a single phoneme 
or phoneme chain, a plurality of data (voice fragment 
data) are stored in the database corresponding respec- 
tively to the pitch, dynamics, tempo, etc. When selecting 

35 a fragment, the most suitable one is chosen according 
to the various control parameters. 
[0064] Moreover, instead of selecting a fragment, it 
may be so arranged that several candidates are select- 
ed for interpolation to obtain SMS data to be used for 

40 synthesis. The selected voice fragments contain deter- 
ministic components and stochastic components which 
are results of the SMS analysis. These deterministic and 
stochastic components contain SMS data, namely, the 
spectral envelopes (strength and phase) of the deter- 

«5 ministic components, the spectral envelopes (strength 
and phase) of the stochastic component, and wave- 
forms themselves. Based on these contents, determin- 
istic components and stochastic components are gen- 
erated so as to match a desired pitch and required du- 

so ration time. For example, the shapes of spectral enve- 
lopes of deterministic and stochastic components are 
obtained by interpolation or other means and may be 
varied so as to match the desired pitch. 

55 Adjustment of Det rministic Component 

[0065] Adjustment of the deterministic compon nt is 
performed by the deterministic component adjusting 
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means 22. 

[0066] In the case of a voiced sound, the deterministic 
component contains strength and phase spectral enve- 
lope information, which are th SMS analysis results. In 
the case of a plurality of fragments, either the fragment 
most ideally suited for th desired control parameter 
(such as pitch) is selected, or a spectral envelope suit- 
able for the desired control parameter is obtained by 
performing an operation such as interpolating the plu- 
rality of fragments. In addition, the shape of the obtained 
sp ctral envelope may be further changed according to 
another control parameter by a suitable method. 
[0067] Moreover, to decrease harsh noises, orto give 
the sound a special characteristic, band pass filtering 
may be applied to allow components of a certain fre- 
quency band to pass. 

[0068] An unvocied sound contains no deterministic 
component. 

Adjustment of Stochastic Component 

[0069] Since the stochastic component from the SMS 
analysis of a voiced sound remains influenced by Its 
original pitch, an attempt to match the sound to another 
pitch may result in an unnatural sound. To prevent this, 
processing needs to be carried out on low frequency sto- 
chastic components to achieve matching with the de- 
sired pitch. This processing is performed by the stochas- 
tic component adjusting means 23. 
[0070] The processing of adjustment of the stochastic 
component will be described with reference to FIGS. 3A 
and 3B. 

[0071] FIG. 3 A is an example of an amplitude spec- 
trum of a stochastic component obtained from an SMS 
analysis of a voiced sound. It is difficult to completely 
r move the effect of the deterministic component, and 
as shown in the figure, there are some peaks in the vi- 
cinity of the harmonics. If this stochastic component is 
used as it is, to synthesize a voice sound at a pitch dif- 
fer nt from the original pitch, peaks will appear in the 
vicinity of lower frequency harmonics, which do not 
blendsmoothly with the deterministic component and 
audible as a harsh sound. To avoid this, the frequency 
of the stochastic component may be varied so as to 
match a change in pitch. However, since high frequency 
stochastic components are less affected by the deter- 
ministic component, it is desirable to use the original am- 
plitud spectrum as it is. In other words, in the low fre- 
quency region, it should be sufficient to compress and 
expand the frequency axis according to the desired 
pitch. However, the original tone color must not be 
changed at this time. Namely, it is necessary that the 
g neral shape of the amplitude spectrum be preserved 
while carrying out this processing. 
[0072] FIG. 3B shows the results of performing the 
above processing. As shown in the figure, thr e peaks 
in the low frequency regi n have be n shifted rightward 
according to the pitch. The gaps between peaks in the 



mid-frequency region hav b en made narrow r, and 
peaks in the high frequency r gion remain unchanged. 
The height of each peak is adjusted to preserve the gen- 
eral shape of the amplitude spectrum, indicated by a 
5 broken line in the figure. 

[0073] In the case of an unvoiced sound, the above 
describ d processing is unnecessary as it is not affected 
by the original pitch. 

[0074] The stochastic component thus obtained by 
*0 the above processing may further be subjected to addi- 
tional processing (such as changing the shape of the 
spectral envelope) according to a control parameter. 
Moreover, to decrease harsh noises, or to give the 
sound a special characteristic, band pass filtering may 
is be applied to allow components of a certain frequency 
band to pass. 

Adjustment of Duration Time 

20 [0075] In the above described processing, the frag- 
ments are processed with their original length main- 
tained, so that singing voice synthesis can only be car- 
ried out in fixed timing. Therefore, depending on the de- 
sired timing, it is necessary to change the duration of 
25 the fragment as required. For example, in the case of a 
phoneme chain , the fragment le ngth can be made short- 
er by thinning out frames within the fragment, or made 
longer by adding duplicate frames within the fragment. 
Moreover, in the case of a single phoneme (the case of 
30 an elongated sound), the elongated part can be made 
shorter by using only some of the frames within the frag- 
ment, or made longer by repeating frames within the 
fragment. 

[0076] When repeating within frames within a frag- 
35 ment of an elongated sound, it is known that noise at 
the junction between frames can be decreased by re- 
peating in a manner of advancing in one direction, re- 
turning in the reverse direction, and then again advanc- 
ing in the original direction (in other words, looping within 
40 a fixed interval or a random interval), rather than repeat- 
ing in a single direction. However, in the case where the 
stochastic component has been segmented into frames 
(of either fixed or variable length) and stored as frequen- 
cy domain data, there is a problem when attempting to 
4 * synthesize a waveform by repeating frequency domain 
frame data in its original format. The reason is that, when 
proceeding in the reverse direction, the waveform in the 
frame must also be reversed with respect to time. To 
generate such a time-reversed waveform from frame 
so data of the original frequency domain, the phase in the 
frequency domain may be reversed and transformed in- 
to the time domain, FIGS. 4 to 4C show this condition. 
[0077] FIG. 4A shows an original waveform of a sto- 
chastic component. A stochastic component for an elon- 
55 gated sound is generated by repeating the interval b - 
tween t1 and t2, by first advancing from t1 until t2, pro- 
ceeding in th revers tim direction after reaching t2, 
and then upon reaching t1 , proceeding in the forward 
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time direction. As noted previously, the stochastic com- 
ponent has been segmented into frames of either fixed 
or variable length and stored as frequency domain data. 
To generate a waveform in the time domain, an inverse 
FFT is performed on the frequency domain frame data, 5 
and a window function and overlapping are applied for 
synthesis of the waveform. In the case where synthesis 
is performed by reading frames in the reverse time di- 
rection, if the frequency domain frame data is trans- 
formed as it is into the time domain, as shown in FIG. 10 
4B, the waveform within each frame remains un- 
changed temporally and only the frame sequence is re- 
versed. This creates discontinuities in the generated 
waveform that cause noise and distortion. 
[0078] A solution to this problem with generation of a 15 
tim domain waveform from frame data is to pre-proc- 
ess the frame data so that a time-reversed waveform 
will be generated. 

[0079] If the original waveform is designated by f(t) 
(which, for the sake of simplicity, is assumed to be infi- 20 
nitely continuous) and a time-reversed waveform g(t), 
and respective Fourier transforms applied to these 
waveforms F(<o) and G(a>), g(t)=f(-t) holds, and since f 
(t) and g(t) are both real functions, the following relation 
is established: 25 

G(a>) = F(a>) r (where * indicates a complex conjugate) 

[0080] When expressed with amplitude and phase. 30 
since the phase of the complex conjugate will be re- 
versed, it will be learned that ail phase spectra of the 
frequency domain frame data should be reversed in or- 
der to generate a time-reversed waveform. In this man- 
ner, as shown in FIG. 4C, the waveform even within 35 

ach frame is reversed with respect to time, and noise 
and distortion are not generated.. 
[0081] The duration time adjusting means 24 per- 
forms the above described fragment compression (thin- 
ning out of frames), expansion (repeating of frames) and *o 
looping (in the case of elongated sounds). Through such 
processing, the duration (or in other words, the length 
of the frame string) of each read-out fragment can be 
adjusted to a desired length. 

45 . 

Adjustment of Fragment Level 

[0082] Furthermore, noise may be audible if the. dis- 
parity between spectral envelope shapes of the . deter- 
ministic component and the stochastic component is too so 
large at the concatenation boundary where one frag- 
ment is connected to another. Performing a smoothing 
process over a plurality of frames at their concatenation 
boundaries can eliminate this problem. 
[0083] This smoothing proc ss will be describ d with 55 
r ference to FIGS. 5 thr ugh 7. 

[0084] Since stochastic components are relatively dif- 
ficult to hear even if there are differences in tone color 



and level at the fragment concatenation boundary, h re, 
a smoothing process will be performed for deterministic 
components only. At this time, to make the data easier 
to process and to simplify the calculations, as shown in 
FIG. 5, a spectral envelope of a d terministic compo- 
nent is considered to consist of a gradient component, 
expressed by a straight line or exponential function, and 
a resonance component, expressed by an exponential 
or other function. Here, the strength of the resonance 
component is calculated based on the gradient compo- 
nent, and a spectral envelope is expressed by adding 
the gradient component and resonance component. In 
other words, the deterministic component is expressed 
as a function that describes the spectral envelope using 
the gradient and resonance components. Here, the val- 
ue of the gradient component, extended up to 0 Hz, is 
called the gradient component gain. 
[0085] Next, the two fragments of "a-i" and "i-a" as 
shown in FIG. 6 are to be concatenated. Because these 
individual fragments have been collected from separate 
recordings, there is a mismatch in tone color and level 
of "i" at the concatenation boundary. As shown in FIG. 
6 f this creates a bump in the waveform at the concate- 
nation boundary, and will be heard as noise. However, 
at a concatenation boundary, a bump can be eliminated ■ 
and noise prevented by cross-fading individual param- - 
eters of the gradient and resonance components, which 
are included in each fragment, over several frames cen- 
tered on and extending before and after the concatena- 
tion boundary. 

[0086] As shown in FIG. 7, to cross-fade the param- 
eters, each fragment parameter is multiplied by a func- 
tion that becomes 0.5 at the concatenation boundary, 
and then the parameters are added together. The ex- * 
ample of FIG. 7 shows the changing strengths of of pri- 
mary resonance components of the "a-i" and "i-a" frag- : 
ments (based on the gradient component), and how the ' 
primary components are cross-faded. 
[0087] In this manner, noise at the concatenation 
boundary between fragments can be avoided by multi- 
plying each parameter (each resonance component, in 
this case) by a cross-fade parameter, and then adding 
them up. 

[0088] Instead of performing the above described 
cross-fading, the levels of individual deterministic and 
stochastic components of fragments may be adjusted 
so as to make the fragment amplitudes before and after 
the concatenation boundary nearly equal. The level ad- 
justment can be performed by multiplying the amplitude 
of each fragment by either a constant or time-varying 
coefficient. 

[0089] An example of level adjustment will now be de- 
scribed for the case where "a-i" and "i-a" are to be con- 
catenated and synthesized similarly to the above case. 
[0090] Here, the matching of th gain of the gradient 
component of each of th fragments will be considered. 
[0091] As shown in FIGS. 8A and 8B, first, the differ- 
ence between the gain of the actual gradient component 
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of each of the fragments "a-i" and "i-a" and a gain ob- 
tained by linearly interpolating gain values between the 
first and last frames (shown as a dashed lin in the fig- 
ures) of each fragment is calculated. 
[0092] Next, typical samples (of the parameters of the 
gradient and resonance components) of each of "a" and 
T phonemes are obtained. The "a-i M data of the first and 
last frames may be used to obtain these typical samples, 
for example. 

[0093] Based on these typical samples, a linear inter- 
polation of the value of the parameter, e.g. gain, of the 
gradient component is performed first. Next, by sequen- 
tially adding together the results of the interpolation and 
th above calculated gain difference, as shown in FIG. 
8C, the values of the gradient component parameter of 
the two fragments will be equal at the boundary, and 
therefore, there will be no discontinuity in the gain of the 
gradient component. Discontinuities in other parame- 
ters, such as the resonance component, can also be 
prevented in a similar manner. 

[0094] Alternatively to the above described method, 
th level adjustment may be performed, for example, by 
transforming deterministic component data into wave- 
form data and then adjusting the levels in the time do- 
main. 

[0095] After the fragment level adjusting means 25 
performs the above described smoothing or level adjust- 
ing between fragments, the fragment concatenating 
m ans 26 concatenates the fragments. 
[0096] Next, the deterministic component generating 
means 27 generates a harmonic series that corre- 
sponds to the desired pitch, while preserving the ob- 
tained deterministic component spectral envelope, 
whereby the actual deterministic component is ob- 
tained. By adding the stochastic component to the ac- 
tual deterministic component, a synthesized singing 
sound is obtained, which is then transformed into a time 
domain signal. For example, in the case where both the 
deterministic component and the stochastic component 
are stored as frequency components, the both compo- 
nents are added together, and the resulting sum is sub- 
jected to an inverse FFT and applying windowing and 
ov napping, whereby a synthesized waveform is ob- 
tain d. 

[0097] It should be noted that the deterministic com- 
ponent and the stochastic component may be subjected 
to an inverse FFT and apply windowing and overlapping 
separately for each component, and then the thus proc- 
essed components may be added together. Moreover, 
a sine wave corresponding to each harmonic of the de^ 
t rministic component may be generated, which is then 
added to a stochastic component obtained by perform- 
ing an inverse FFT and applying windowing and over- 
lapping. . 

[0098] FIGS. 9A and 9B is a functional block diagram 
illustrating, in gr ater detail than FIGS. 2A and 2B, th 
configuration of the singing voice synthesizing appara- 
tus according to the present embodiment. In FIGS. 9A 



and 9B, the same elements and parts as in FIGS. 2A 
and 2B are designated by identical reference numerals. 
Moreover, in the illustrated example, the phoneme 
(voice fragment) database 10 contains deterministic 
5 components wh ich include amplitude spectral envelope 
information thereof for each frame, and stochastic com- 
ponents which include amplitude spectral envelope in- 
formation and phase spectral envelope information 
thereof for each frame. 
10 [0099] In FIGS. 9A and 9B, reference numeraal 31 
designates a lyric-melody separating means that sepa- 
rates lyric data and melody data from the music score 
data of a song for which a singing voice is to be synthe- 
sized, and 32 a lyric-to -phonetic code conversion 
is means that converts the lyric data from the lyric-melody 
separating means 31 into a string of phonetically coded 
data (phonemes). A phoneme string from the lyric-to- 
phonetic code conversion means 32is input to the pho- 
neme (phonetic code)-to-fragment conversion means 
» 21 . Various control parameters, such as tempo, may be 
input to control the musical performance. Pitch informa- 
tion and dynamics information such as dynamic marks 
that has been separated from the music score data by 
the lyric-melody separating means 31, and the control 
» parameters are input to a pitch determining means 33, 
which in turn determines the pitch, dynamics, and tempo 
of the singing sound. Fragment information from the 
phoneme-to-fragment conversion means 21 and infor- 
mation such as pitch, dynamics, and tempo from the 
o pitch determining means 33 are fed to a fragment se- 
lecting means 34. The fragment selecting means 34 
searches the voice fragment database (phoneme data- 
base) 10 and outputs the most suitable fragment data. 
At this time, if there is stored no fragment data that com- 
s pletely matches the search conditions, data of one or a 
plurality of similar fragments is read out. 
[0100] Deterministic component data included in the 
fragment data output from the fragment selecting means 
34 is fed to the deterministic component adjusting 
> means 22. In the case where a plurality of fragment data 
have been read out by the fragment selecting means 
34, a spectral envelope interpolator 35 within the deter- 
ministic component adjusting means 22 performs inter- 
polation so that the search conditions are satisfied, and 
: as necessary, a spectral envelope shaper 36 changes 
the shape of the spectral envelope according to the con- 
trol parameters. 

[01 01 ] On the other hand, stochastic component data 
included in the fragment data output from the fragment 
selecting means 34 is input to the stochastic component 
adjusting means 23. This stochastic component adjust- 
ing means 23 is supplied with pitch information from the 
pitch determining means 33, and as was described with 
reference to FIG. 3, compresses or expands the fre- 
quency axis for low frequ ncy stochastic compon nts 
according to a d sired pitch. Namely, a band pass filter 
37 divides the amplitude spectrum and phase spectrum 
of a stochastic component into the three regions of low 
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frequ ncy, mid-frequency and high fr quency. Frequen- 
cy axis compressor-expanders 38 and 39 compress or 
expand the frequency axis according to the desired pitch 
for the low frequency and mid-frequency regions, re- 
spectively. Low and mid-frequency region signals result- s 
ing from the frequency axis compression or expansion, 
and a high frequency region signal based on the high 
frequ ncy region for which no frequency axis compres- 
sion or expansion has been performed, are fed to a peak 
adjuster 40 where peak values of these signals are ad- 10 
justed so as to preserve the shape of the spectral enve- 
lope of this stochastic component. 
[0102] The deterministic component data from the de- 
terministic component adjusting means 22 and the sto- 
chastic component data from the stochastic component is 
adjusting means 23 are input to the duration time ad- 
justing means 24. Then, the duration time adjusting 
means 24 changes the time length of the fragment ac- 
cording lo a sounding lime length which is determined 
by the melody information and the tempo information. 20 
As previously described, in the case where the duration 
time of the fragment is to be made shorter, the time axis 
compressor-expander 43 performs the process of thin- 
ning out frames, and in the case where the duration time 
is to be made longer, a loop section 42 performs the loop 25 
processing described with reference to the FIGS. 4A to 
4C. 

[0103] The.fragment data whose duration time has 
been adjusted by the duration time adjusting means 24 
is subjected to a level adjusting process by the fragment 30 
level adjusting means 25 as described previously with 
reference to the FIGS. 5 through 8C, and the determin- 
istic components and stochastic components of the lev- 
el adjusted fragment data are each concatenated into 
r spective time series by the fragment concatenating 35 
means 26. 

[0104] The deterministic components (spectral enve- 
lope information) of the fragment data concatenated by 
the fragment concatenating means 26 are input to the 
deterministic component generating means 27. This de- 40 
terministic component generating means 27 is supplied 
with pitch information from the pitch determining means 
33, and based on the spectral envelope information, 
generates harmonic components corresponding to the 
pitch information from which the actual deterministic *s 
component for each frame is obtained. 
[0105] Next, the adder 28 synthesizes a frequency 
domain signal for each frame by combining stochastic 
component amplitude and phase spectral envelope in- 
formation from the fragment concatenating means 26 so 
with deterministic component amplitude spectrum infor- 
mation from the deterministic component generating 
m ans 27. . 

[0106] Then, the frequency domain signal for each 
frame thus synthesiz d is transform d by an inv rs 55 
Fourier transform m ans (invers FFT means) 51 into 
a time domain waveform signal. Next, a windowing 
means 52 multiplies the time domain waveform signal 



by a windowing function that corr sponds to the fram 
length, and an overlap means 53 synthesizes a time 
waveform signal by overlapping the time domain wave- 
form signals for respective frames. 
[0107] Then, a D/A conversion means 54 converts the 
thus-synthesized time waveform signal into an analog 
signal that is output via an amplifier 55 to a speaker 56 
to be sounded therefrom. 

[0108] FIG. 1 0 illustrates an example of the construc- 
tion of a hardware apparatus used to operate the spe- 
cific example shown in FIGS. 9A and 9B. In the figure, 
reference numeral 61 designates a central processing 
unit (CPU) that controls the overall operation of the sing- 
ing voice synthesizing apparatus, 62 a ROM that stores 
various programs, constants and other data, 63 a RAM 
that stores a work area and various data, 64 a data 
memory, 65 a timer that generates prescribed timer in- 
terrupts or the like, 66 a lyric-melody input unit that in- 
puts music score, lyric and other data of a song to be 
performed, 67 a control parameter input unit that inputs 
various control parameters related to the performance, 

68 a display that displays various types of information, 

69 a D/A converter that converts the synthesized singing 
voice data into an analog signal, 70 an amplifier, 71 a 
speaker, and 72 a bus that interconnects ait the above- 
mentioned component elements. 

[0109] The phoneme database 10 is loaded into the 
ROM 62 or the RAM 63. A singing sound is synthesized 
in the above described manner according to the data 
input by the lyric-melody input unit 66 and the control 
parameter input unit 67, and a singing sound is output 
from the speaker 71 . 

[01 1 0] The construction of the hardware apparatus of 
FIG. 1 0 is identical with that of an ordinary general-pur- 
pose computer. The above described functional blocks 
of the singing voice synthesizing apparatus of the 
present invention may also be realized by an application - 
program executed by a general-purpose computer. 
[0111] In the above described embodiment, the frag- 
ment data stored in the database 1 0 is SMS data, which 
is typically comprised of a spectral envelope of the de- 
terministic component for each unit time (frame), and 
amplitude and phase spectral envelopes of the stochas- 
tic component for each frame. As described above, by 
storing fragment data of elongated sounds, such as long 
vowels, a high-quality singing sound can be synthe- 
sized. However, especially in the case of elongated 
sounds, there is the problem of large data sizes due to 
the storage of deterministic and stochastic components 
for each time instance (frame) during the interval of the 
elongated sound. 

[0112] In the case of deterministic components . it is 
sufficient to store data for each frequency that is an in- 
teger multiple of the fundamental pitch. For example, if 
th fundam ntal pitch is 150 Hz and the maximum fre- 
qu ncy is 22025 Hz, th amplitude (or phas ) data of 
the 150 Hz frequency must b stored. On th other 
hand, in the case of stochastic components, a much 
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larger quantity of data ar r quired, that is, the amplitud 
spectral envelope and phase spectral envelope must be 
stored for all frequencies. If 1024 points ar sampled 
within a frame, the amplitude and phase data for 1024 
frequencies is required. Especially in the case of elon- 
gated sounds, the quantity of data becomes extremely 
large since data must be stored for all frames within the 
interval of the elongated sound. Moreover, the data of 
the elongated sound interval must be provided for each 
of individual phonemes, and as described above, the 
data should desirably be provided for each of various 
pitches to increase naturalness, but this leads to a fur- 
ther increase in the quantity of data in the database. 
[01 1 3] Therefore, another embodiment of the present 
invention, which enables the size of the database to be 
made extremely small, will be described below. Accord- 
ing to this embodiment, a means is added for whitening 
the spectral envelope when storing stochastic compo- 
nent data of elongated sounds to generate the database 
10. Also, a means for generating a stochastic compo- 
nent spectral envelope during synthesis of a singing 
sound is provided within the stochastic component ad- 
justing means. Thus, the data size can be reduced be- 
cause it is unnecessary to store individual spectral en- 
velopes of the stochastic components of elongated 
sounds. 

[0114] FIG. 11 shows an example of spectral enve- 
lopes of the deterministic and stochastic components of 
an elongated sound. As shown in the figure, in the case 
of an elongated sound, the spectral envelope of the sto- 
chastic component generally resembles that of the de- 
terministic component. Namely, the locations of peaks 
and valleys are roughly aligned. Therefore, a suitable 
stochastic component spectral envelope can be ob- 
tained by performing some arbitrary processing (such 
as gain adjustment, adjustment of the overall gradient, 
etc.) on the spectral envelope of the deterministic com- 
ponent. 

[0115] Moreover, in the case of an elongated sound, 
each frequency component in each frame within a cer- 
tain interval to be processed has a slight fluctuation that 
is important. The degree of this fluctuation is not con- 
sidered to change much even when a vowel changes. 
Th refore, an amplitude spectral envelope of a stochas- 
tic component is flattened in advance by some means 
(whitening) to eliminate the influence of the tone color 
of the original vowel. The spectrum appears flat due to 
the whitening. Then, at the time of synthesis, a spectral 
envelope of the stochastic component is determined 
based on the shape of the spectral envelope of the de- 
terministic component and the determined stochastic 
component spectral envelope is multiplied by the whit- 
n d spectral envelope to obtain an amplitude spectrum 
of the stochastic component. In other words, only the 
spectral envelope of the stochastic component is gen- 
erated bas d on the deterministic component spectral 
envelope, while th phase included in the original sto- 
chastic component of the elongated sound, is used as 



it is. In this manner, stochastic components of different 
elongated vowel sound data can be generated based 
on whitened elongated sound data. 
[0116] FIG. 12 illustrates a process for generating the 
5 phoneme database 1 0 according to this embodiment. In 
the figure, component elements and parts correspond- 
ing to those in FIG. 1 are designated by identical refer- 
ence numerals, description of which is omitted. As 
shown in FIG. 12, for elongated sounds, this embodi- 
10 ment has a spectral whitening means 80 that whitens 
the amplitude spectrum of a stochastic component hav- 
ing been output from the segmentor 14. Therefore, the 
only data stored are the whitened amplitude spectrum, 
as the amplitude spectrum of a stochastic component 

'5 of the elongated sound, and the phase spectrum, as the 
stochastic component of each fragment data. 
[0117] FIG. 13 shows an example of the configuration 
of the spectral whitening means 80. 
[0118] As previously noted, the stochastic component 

20 amplitude spectrum of an elongated sound is whitened 
by this spectral whitening means 80, and appears flat. 
However, at this time, the spectral envelopes of ail 
frames within an interval for processing are not made 
completely flat (i.e. not the same spectral value at all 

25 frequencies). It is important that the small temporal fluc- 
tuations of each frequency be retained while making the 
spectral envelope shape in each frame nearly flat. To 
this end, as shown in FIG. 13, a typical amplitude spec- 
tral envelope generator 81 generates a typical envelope 

30 of the amplitude spectrum within an interval for process- 
ing, a spectral envelope inverse generator 82 generates 
the inverse of each frequency component of the spectral 
envelope, and a filter 83 multiplies the output of the 
spectral envelope inverse generator 82 by individual f re- 

35 quency components of the spectral envelope of each 
frame. 

[01 19] Here, a typical envelope of an amplitude spec- 
trum within the interval may also be generated, for ex- 
ample, by calculating an average value of the amplitude 
40 spectrum for each frequency and using those average 
values as the typical spectral envelope. Alternatively, 
the maximum value of each frequency component with- 
in the interval may be used as the typical spectral enve- 
lope. 

4 $ [0120] As a result, whitened amplitude spectra can be 
obtained from the filter 83. Moreover, the phase spectra 
are stored directly as stochastic component information 
of the fragment. 

[0121] In this manner, the stochastic component of an 
so elongated sound is whitened, anrfthe spectral envelope 
of the deterministic component is used during synthesis 
to generate the stochastic component. Therefore, if the 
whitened stochastic component is a stochastic compo- 
nent, it can be used commonly for all vowels. In other 
55 words, in th case of a vowel, asingl whit ned stochas- 
tic compon nt of an longated sound is sufficient. Of 
cours , a plurality of whit ned stochastic components 
may be provided. 
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[0122] FIGS. 14A and 14B illustrates a synthesis 
process which is executed in the case where the whit- 
ened amplitude spectra of the stochastic components 
of elongated sounds are stored in the above described 
manner. In the figure, component elements and parts 
coresponding to those in FIGS. 2A and 2B are desig- 
nated by identical reference numerals, description of 
which is omitted. As shown in the figure, according to 
this embodiment, a spectral envelope generating 
means 90, to which are input stochastic components 
(whitened amplitude spectra) of fragments that have 
been read out from the database 10, is added on the 
upstream side of the stochastic component adjusting 
means 23. 

[0123] When the whitened stochastic component of 
an longated sound is read out from the phoneme da- 
tabase 10, the spectral envelope generating means 90 
calculates the amplitude spectral envelope of the sto- 
chastic component based on the spectral envelope of 
the deterministic component, as described above. For 
example, a method a method is considered, in which, 
assuming that the component at the maximum frequen- 
cy does not change, the amplitude spectral envelope of 
the stochastic component is determined by changing 
only the gradient of the spectral envelope. 
[0124] Then, the determined amplitude spectral en- 
velope, together with the phase spectrum of the sto- 
chastic component that has been read at the same time, 
are input to the stochastic component adjusting means 
23. The subsequent processing is the same as was il- 
lustrated in FIGS. 2A and 2B. 

[01 25] As described above, when the amplitude spec- 
tra of stochastic components of elongated sounds are 
to be whitened and stored, the whitened amplitude 
spectra of stochastic components of some of the elon- 
gated sounds may be stored, while the amplitude spec- 
tra of stochastic components of the other elongated 
sounds are not stored. 

[0126] In this case, if one of the other elongated 
sounds is to be synthesized, the amplitude spectra of 
the stochastic components of this elongated sound are 
not included in the fragment data of the elongated 
sound. Thererefore, a phoneme that most closely re- 
sembles the phoneme to be synthesized is extracted 
from the database. Using the stochastic components of 
the elongated sound, amplitude spectra of the stochas- 
tic components may be generated in the above de- 
scribed manner. 

[0127] Moreover, phonemes from which elongated 
sounds can be gene'rated may be divided into one or 
more groups, and using one of elongated sound data 
belonging to the group affiliated with the phoneme to be 
synth sized, amplitude spectra of the stochastic com- 
pon nts may be generated in the above described man- 
ner. 

[0128] Further, when using the amplitud spectra of 
stochastic components obtained from th whitened am- 
plitude spectra and the amplitude spectra of determin- 



istic components, all or a part of the fr quency axes of 
the stochastic component phase spectra are shifted so 
that data indicative of harmonics and their viciniti s cor- 
responding to the pitch of the original data becomes in- 

5 dicative of harmonics and their viciniti s corresponding 
to the desired pitch at which the sound is to be repro- 
duced. In other words, a more natural synthesized 
sound can be obtained by using the phase data indica- 
tive of harmonics and their vicinities as it is during syn- 

10 thesis. 

[0129] According to this embodiment, the database 
does not have to store an elongated sound stochastic 
component for every vowel, and therefore the quantity 
of data can be reduced. 

is [0130] Furthermore, in the case where the spectral 
envelope of the stochastic component is determined by 
changing only the gradient of this spectral envelope, the 
"degree of huskiness" of the synthesized voice can be 
controlled by correlating the change in gradient with 

20 huskiness. 

[0131] More specifically, the synthesized voice will be 
husky if it contains many stochastic components, and 
Will be smooth if it contains few stochastic components. 
Therefore, if the gradient is steep (the gain at 0 Hz is 

25 large), the voice will be husky, and if the gradient is slight 
(the gain at 0 Hz is small), the voice will be smooth. 
Therefore, as shown in FIG. 15, the gradient of the spec- 
tral envelope of the stochastic component is controlled 
according to a parameter that expresses the degree of 

30 huskiness, to thereby control the huskiness of the syn- 
thesized voice. 

[0132] FIG. 16shows an example of the configuration 
of the spectral envelope generating means 90 which is 
adapted to control the degree of huskiness. A spectral 

35 envelope generator 91 multiplies the spectral envelope 
of the deterministic component by a gradient value that 
corresponds to the huskiness information supplied as a 
control parameter. A filter 92 adds characteristics thus 
obtained to the whitened amplitude spectrum of the sto- 

*o chastic component. Then, the phase spectral envelope 
of the stochastic component and the output from the fil- 
ter 92 are fed as stochastic component data to the sto- 
chastic component adjusting means 23. 
[0133] It is also possible to model the spectral enve- 

45 jope of the deterministic component in a suitable man- 
ner and correlating a parameter of the model and the 
degree of huskiness. For example, the spectral enve- 
lope of the stochastic component may also be calculat- 
ed by correlating the degree of huskiness and any one 

so of parameters (a parameter related to gradient) used in 
formularizing the spectral envelope of the deterministic 
component ,by changing the parameter. 
[0134] Furthermore, the degree of huskiness may be 
constant or may be varied over time. In the case of tirne- 

55 varying huskiness, an int resting ff ctcanb obtained 
wherein a voice becomes gradually more husky during 
th elongation of a phoneme. 

[0135] Moreover, for the sole purpose of controlling 
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the d 6 gr ee of huskiness, it is unn cessary to store the 

nenn: e lT ,itUde SPeCtrUm ° f a stochastic C ™P° 
Zl lTr P T S d3tabase 10 35 describ «d above, 
tudl ' r embodiment described above, the ampli- 

the stochas,ic component of an ei °"- 

gated sound ,s stored as it is, similarly as for other frag- 
bv 0 1 9 Synth6SiS ' 3 f,at spectrum is generated 
elonoiSr 9 3 ^ PiCa ' amP ' itUde Spectrum ^hin the 
theZ h T ,nterVa '' and mul «P«yin9 the inverse 

comnn Vt! amP,itUdS "P"*'"" ° f S tOChastiC 

component. Then, based on the amplitude spectrum of 

the stochastic component is calculated according to the 
parameter that controls the degree of huskiness. The 
at spectrum is then multiplied by the calculated amp.i- 
tude spectrum of the stochastic component to obtain the 
ampntude spectrum of the stochastic component 
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Claims 



A singing voice synthesizing apparatus comprising: 

a phoneme database that stores a plurality of 
vo.ee fragment data foiled of voice fragments 
each be,ng a single phoneme or a phoneme 
chain of at least two concatenated phonemes 
each of the plurality of voice fragment data 
composing data of a deterministic component 
and data of a stochastic component; 
an input device that inputs lyrics- 
a readout device that reads out from said pho- 
neme database the voice fragment data corre- 
sponding to the inputted lyrics; 
a duration time adjusting device that adjusts 
time duration of the read-out voice fragment da- 
ta so as to match a desired tempo and manner 
of singing; 

an adjusting device that adjusts the determin- 
.stic component and the stochastic component 
of the read-out voice fragment so as to match 
a desired pitch; and 

a synthesizing device that synthesizes a sing- 
ing sound by sequentially concatenating the 
voice fragment data that have been adjusted by 
said duration time adjusting device and said ad- 
justing device. 
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3S 7. 



40 



to cS Synth SiZi " 9 8pparatus acc °*ing 
to claim 1 , where.n said phoneme database stores 

that Vre 3 JIT e, ° nga,ed s ° u " d * 

that are each enunciated by elongating a single 

Phones, voice fragment data comprising conso 

nant-to-vowel phoneme chains and vowel-to<on- 

5 ' toctfm? I' 06 Synthesi2in 9 W'atus according 
to cla.m 1 , wherein each of said voice fragment data 
compr lses plura(jty of data Q J 

nvelytoapluralrtyofframesofaframestringformed 
tLnlT^ 3 COrres P° ndi "9 one of the vTice 
fragments, and wherein the data of the deterministic 
component and the data of the stochastic compo 
nent of each of said voice fragment data each com- 
pnse a ser.es of frequency domain data corre- 
spond,^ respective.y to the plurality of frames of 
fragme'ts 9 C ° rreSpof1di " g to -* * the voice 

to £ ?2 Synthes * in 9 a PParatus according 
11 T " Safd dUra,i0n time ad i usti "9 de 

ZJh ** 3 frame String of a desired time 
eng h by repeat.ng at least one frame of the plural- 
each ST* ° f framS Strin9 C0 ^P°ncfng to 
each of the vo.ee fragments, or by thinning out a 

predetermined number of frames of the plurality of 
frame S of the frame ^ ^ 0 

the voice fragments. 

to cZl Synthesi2in 9 a PP^atus according 
to cla.m 6 , wherein said duration time adjusting de - 

lennth e h nerateS the frame String of a desired time 
length by repeating a plurality of frames of the frame ' 

stmg corresponding to each of the voice fragments 

ofult'l 0 : ^ adjUStin9 d6vice ^PeaSngTe 
Pluralrty of frames in a first direction in whicn the 
frame stnng of a desired time length is generated 
and in a second direction opposite thereto 



2. 



toe3? T Synthesi2in 9 a PParatus according 

aTZl; ? ere,n said phoneme database st °^ 

a plurality of vo.ee fragment data having different 
musica, expressions for a single phoneme or pht 
neme chain. K 

A singing voice synthesizing apparatus according 

duee'Tt WH r6in S8id mUSiCa, *P r ""one in 
c ude at least one param ter selected from the 
group cons.st.ng of pitch, dynamics and tempo 



ro 2 % V °k 6 s y ntnesizi "9 apparatus according 
to cla.m 7, wherein when repeating the plurality of 
frames of the frame string corresponding to the data 
of the stochastic compoenent of each of the voice 

» du 9 T en ^ jn firSt and S6COnd «™«on S , said 
durat.on time adjusting device reverses a phase of 
a phase spectrum of the stochastic component. 

9. A singing voice synthesizing apparatus according 

« us^Th " rther COmpriSing 8 fragment ad" 
lus ,ng device that performs smoothing proe ssing 

clot a ? St ? Pr ° C SSi0g 00 the ^erministic 
component and the stochastic component con- 
tamed .n each of the voice fragment data when the 
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voic fragment data are sequentially concatenated 
by said synthesizing device. 

10. A singing voice synthesizing apparatus according 

to claim 5, further comprising a deterministic com- s 
ponent generating device that changes only pitch 
of the deterministic component to a desired pitch 
while preserving the spectral envelope shape of the 
deterministic component contained in each of the 
voice fragment data when the voice fragment data 10 
are sequentially concatenated by said synthesizing 
device. 

11. A singing voice synthesizing apparatus according 

to claim 5 f wherein said phoneme database stores is 
voice fragment data comprising elongated sounds 
that are each enunciated by elongating a single 
phoneme, said phoneme database further storing a 
flat spectrum as an amplitude spectrum of the sto- 
chastic component of each of the voice fragment 20 
data comprising each of the elongated sounds, ob- 
tained by multiplying the amplitude spectrum there- 
of by an inverse of a typical spectrum within an in- 
terval of the elongated sound. 

25 

12. A singing voice synthesizing apparatus according 
to claim 1 1 , wherein the amplitude spectrum of the 
stochastic component of each of the voice fragment 
data comprising each of the elongated sounds is 
obtained by multiplying an amplitude spectrum of 30 
the stochastic component calculated based on an 
amplitude spectrum of the deterministic component 

of the voice fragment data of the elongated sound, 
by the flat spectrum. 

35 

13. A singing voice synthesizing apparatus according 
to claim 12, wherein said phoneme database does 
not store amplitude spectra of stochastic compo- 
n nts of voice fragment data comprising certain 
elongated sounds, and the flat spectrum stored as *o 
an amplitude spectrum of voice fragment data com- 
prising at least one other elongated sound is used 

for synthesis of the certain sounds. 

14. A singing voice synthesizing apparatus according 4s 
to claim 12, wherein the amplitude spectrum of the 
stochastic component calculated based on the am- 
plitude spectrum of the deterministic component 
has a gain thereof at 0Hz controlled according to a 
parameter for controlling a degree of huskiness. so 

15. A singing voice synthesizing method comprising the 
steps of: 

storing in a phoneme database a plurality of 55 
voice fragment data form d of voic fragments 
each being a single phon m or a phoneme 
chain of at least two concatenated phonemes, 



each of said plurality of voic fragment data 
comprising data of a deterministic component 
and data of a stochastic component; 
reading out from said phoneme database the 
voice fragment data corresponding to lyrics in- 
putted by an input device; 
adjusting time duration of the read-out voice 
fragment data so as to match a desired tempo 
and manner of singing; 

adjusting the deterministic component and the 
stochastic component of the read-out voice 
fragment so as to match a desired pitch; and 
synthesizing a singing sound by sequentially 
concatenating the voice fragment data that 
have been adjusted in respect of the time du- 
ration and the deterministic component and the 
stochastic component thereof, 

16. A program for causing a computer to execute a 
singing voice synthesizing method comprising the 
steps of: 

storing in a phoneme database a plurality of 
voice fragment data formed of voice fragments 
each being a single phoneme or a phoneme 
chain of at least two concatenated phonemes, 
each of said plurality of voice fragment data 
comprising data of a deterministic component 
and data of a stochastic component; 
reading out from said phoneme database the 
voice fragment data corresponding to lyrics in- 
putted by an input device; 
adjusting time duration of the read-out voice 
fragment data so as to match a desired tempo 
and manner of singing; 

adjusting the deterministic component and the 
stochastic component of the read-out voice 
fragment so as to match a desired pitch; and 
synthesizing a singing sound by sequentially 
concatenating the voice fragment data that 
have been adjusted in respect of the time du- 
ration and the deterministic component and the 
stochastic component thereof. 

17. A mechanically readable storage medium storing 
instructions for causing a machine to execute a 
singing voice synthesizing method comprising the 
steps of: 

storing in a phoneme database a plurality of 
voice fragment data formed of voice fragments 
each being a single phoneme or a phoneme 
chain of at least two concatenated phonemes, 
each of said plurality of voice fragment data 
comprising data of a deterministic compon nt 
and data of a stochastic component; 
reading out from said phoneme database the 
voice fragment data corresponding to lyrics in- 
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putt d by an input device; 
adjusting time duration of the read-out voice 
fragment data so as to match a desired tempo 
and manner of singing; 

adjusting the deterministic component and the 5 
stochastic component of the read-out voice 
fragment so as to match a desired pitch; and 
synthesizing a singing sound by sequentially 
concatenating the voice fragment data that 
have been adjusted in respect of the time du- w 
ration and the deterministic component and the 
stochastic component thereof. 
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